*note: a prior version of the files truncated some rows of the embeddings. As of 5/12/25, this has been updated. Please re-download if you downloaded prior to this date.

# Node and Edge files
- `ClinGraph_node.csv`: this contains all the node metadata and index information.
    - `node_id`: string identifier of a node. We concatenate the original alphanumeric code from the clinical vocabulary with the name of the clinical vocabulary. To get original code, `df['node_id].str.split(':', expand=True)[0]`
    - `node name`: name of node as listed in the source clinical vocabulary
    - `ntype`: source clinical vocabulary
    - `node_index`: the unique integer identifier of each node. This is typically used when internally constructing the graph. 

- `ClinGraph_edges.csv`: this contains the triplet information used to construct the KG. We also include each node's metadata that's found in `ClinGraph_node.csv` for convenience. 
    - `edge_index`: the unique integer identifier of each edge. This is typically used when internally referring to an edge in the graph. 
    - `relationship`: this is either 'DEFINED' or 'ASSOC' (associated) and describes whether the relationship between the two nodes is from a defined hierarchy or mapping (e.g. GEM mapping) or a broader association (e.g. 'treatment-of', 'product-of') which typically stems from UMLS or PheMap. We do not model edge types but provide this information for convenience.   


# Loading ClinGraph 

## DGL 

The model was originally developed using DGL. We store the node types and node features under the node data attribute. 

```
from dgl.data.utils import load_graphs
graph_list, _ = load_graphs("ClinGraph_dgl.bin")
g = graph_list[0]

# node features
print(g.ndata['feat'])

# node type (as indices)
(g.ndata['ntype'])
```
## NetworkX

Read in as an adjacency list. This does not include node features. Those can be added using `ClinGraph_features.csv` where each node has a 128-dim vector.

```
import networkx as nx
g = nx.read_adjlist("ClinGraph_adjlist.csv")
```

# Pytorch Geometric
Read in as pytorch geometric Data object. Node features are saved under the `x` attribute. 
```
from torch_geometric.data import Data
import torch

g = torch.load('ClinGraph_pyg.pt', weights_only=False)

# node features
print(g.x)
```

# Loading ClinVec embeddings

We separate files by source vocabulary. Each set of embeddings is saved as a pandas dataframe where the 128 columns correspond to the dimensions of the embedding and the row index matches the `node_index`. 

```
import pandas as pd

# load phecode embeddings
df = pd.read_csv("ClinVec_phecode.csv")

# get matrix of embeddings
emb_mat = df.values

# get node metadata
node_df = pd.read_csv("ClinGraph_nodes.csv", sep='\t')
df['node_index'] = df.index
phecode_emb_df = df.merge(node_df, how='inner', on='node_index')
```